Introduction

This workshop introduces the tidyverse packages, which support data processing and visualisation in R. All of these packages share the same concepts of data handling, which makes it easier to build workflows that move between different kinds of task:

This workshop assumes a basic familiarity with the R language, as covered in the introductory R Programming workshop.

Getting started in RStudio

  • Bottom left: console window (type commands here and see the output directly)
  • Top left: editor window (for editing R notebooks and scripts)
  • Top right: workspace / history window (examine the current workspace, import datasets, see commands entered previously)
  • Bottom right: files / plots / packages / help window (change working directory, install packages, see graphics output, browse help)

R notebooks

This document is an example of an R notebook, which combines text and code and makes it easy to embed R analysis within a report (ouput as HTML, PDF, Word document or presentation slides). This can help to make research more reproducible by allowing you to share an entire analysis workflow together with a narrative.

When you open the notebook’s source code (.Rmd file) in RStudio, you can view and edit it in the editor window.

Text is formatted using the R markdown notation, which is derived from Markdown. This is a simple way to apply styling to text and indicate the structure of your document.

R code is included as chunks, which look like this:

## R code lives here
print("Hello RStudio!")
## [1] "Hello RStudio!"

When the cursor is inside a chunk, you can execute the code using Ctrl-Shift-Enter. The commands and output appear in the console window as if the chunk had been copy-pasted there. The output also appears in the notebook just after the chunk.

A notebook is a living document. You are encouraged to make use of this notebook to try out the example code, alter it, complete the exercises and add your own notes and code chunks. You can insert a new code chunk using the shortcut Ctrl+Alt+I.

Working directory

Before you start work, please check your working directory:

getwd()

You can change the working directory using the files window in RStudio (bottom right). Navigate to the directory containing the workshop data and click the cog icon for the “Set as Working Directory” option.

About packages

Many useful R functions are developed by individuals and research groups and made available to the community as packages.

You can find a full list of R packages at the CRAN page.

Platforms such as RStudio and Anaconda now make it easy to install these published packages and their dependencies. The tidyverse packages are already installed as part of RStudio, so we just need to load them into the workspace:

library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Note that this command loads all of the core tidyverse packages, but you can also load them individually as necessary.

The data

Today’s session will focus on data taken from the GapMinder project.

Loading data (readr)

Excel spreadsheets

To begin, we will use the readxl package to load data from the relevant sheets in the Excel workbook. This package is not part of the core tidyverse, so we need to load it directly:

library(readxl)

The command read_excel() reads both xls and xlsx files and detects the format from the extension. Today we will only use the sheet called list-of-countries-etc from this workbook.

read_excel("data_geographies_v1.xlsx", sheet = "list-of-countries-etc")

RStudio shows you the output as a table. Let’s capture the loaded data using the variable countries:

countries <- read_excel("data_geographies_v1.xlsx", sheet = "list-of-countries-etc")

We can get a quick overview of the data using the str() function:

str(countries)
## Classes 'tbl_df', 'tbl' and 'data.frame':    197 obs. of  11 variables:
##  $ geo                             : chr  "afg" "alb" "dza" "and" ...
##  $ name                            : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ four_regions                    : chr  "asia" "europe" "africa" "europe" ...
##  $ eight_regions                   : chr  "asia_west" "europe_east" "africa_north" "europe_west" ...
##  $ six_regions                     : chr  "south_asia" "europe_central_asia" "middle_east_north_africa" "europe_central_asia" ...
##  $ members_oecd_g77                : chr  "g77" "others" "g77" "others" ...
##  $ Latitude                        : num  33 41 28 42.5 -12.5 ...
##  $ Longitude                       : num  66 20 3 1.52 18.5 ...
##  $ UN member since                 : POSIXct, format: "1946-11-19" "1955-12-14" ...
##  $ World bank region               : chr  "South Asia" "Europe & Central Asia" "Middle East & North Africa" "Europe & Central Asia" ...
##  $ World bank, 4 income groups 2017: chr  "Low income" "Upper middle income" "Upper middle income" "High income" ...

Tibbles

Notice that as well the class data.frame, the object countries also belongs to the classes tbl_df and tbl. This shows that the data has been loaded in the form of a tibble.

Tibbles are part of the tidyverse architecture. They are like data frames, but they do less (e.g., they don’t attempt to coerce variables into different types) and complain more (e.g., when a variable does not exist). The idea is to force the programmer to deal with issues earlier, and so make it harder to write confusing or broken R code.

CSV files

Let’s look at another, related data set, this time loaded from a CSV file using read_csv() from the readr package:

co2 <- read_csv("yearly_co2_emissions_1000_tonnes.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   country = col_character()
## )
## See spec(...) for full column specifications.
co2

This is a straightforward numerical tibble which contains a lot of missing values. The table shows annual CO_2 emissions from burning fossil fuels for each country.

NB read_csv() is a different function to read.csv() from base R (which outputs a base data.frame rather than a tbl_df).

Tabular data in other formats (e.g. with tabs as delimiters, or using fixed-width fields) can also be read using other functions from readr.

Tidying data (tidyr)

The tidyverse packages put a lot of emphasis on working with tidy data. What do we mean by that?

Tidy data has the three following attributes:

  1. Every column is a variable
  2. Every row is an observation (also known as a “case”)
  3. Every cell holds a single value

When data is tidy, we can visualise and analyse it more easily.

However, most of the data tables that you encounter “in the wild” will not be tidy by this definition, so the tidyr package provides functions to help reshape them into a tidy form.

Look at the co2 tibble. What are the observations and what are the variables?

gather()

We can use the gather() function to tidy the co2 tibble:

co2 <- gather(co2, -country, key=year, value=kt, na.rm=TRUE)
co2

gather() works to lengthen the data table by collecting observations from multiple columns.

We specify the columns to use (all columns except the country name) and provide the names of two new variables, one to hold the old variable names (“year”) and one to hold the observations collected (“tonnes_per_person”).

Notice that the tidyverse functions usually allow you to omit quotes when referring to column names.

By setting na.rm=TRUE, we discard the cells that did not contain observations.

Changing data type

One complication here is that the year variable is shown as having a character data type. This is because the years have been derived from column names (strings) in the previous version of the table. Let’s fix this before going any further:

co2$year <- parse_integer(co2$year)
co2

CSV without headers

Now that co2 is in a tidy form, let’s look at another example. 1997_stats.csv is a CSV file containing GDP and population for various countries for the year 1997.

Actually this file is not in a correct CSV format, because it is missing a header row. You can open it in Excel to verify this. However, we can still load it using read_csv() as follows:

stats97 <- read_csv("1997_stats.csv", col_names=FALSE)
## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_double()
## )
stats97

You can see that the two columns in this file have been given the names X1 and X2.

This file looks a bit harder to tidy. What are the variables in this data set and what are the observations?

separate()

First, we need to split the X1 data into two columns: one for the three-letter country code (the variable named geo in our countries tibble) and one for the type of measurement (GDP or population). We can do this with the function separate():

stats97 <- separate(stats97, col=X1, sep="-", into=c("geo","measurement"))
stats97

Take a look at the command above. The arguments to separate() give the tibble to be processed, the name of the column to be split, the string to use as the delimiter, and a vector of strings giving the names for the resulting columns.

spread()

Now we need to separate the GDP and population values into two different columns. This is done using the spread() function:

stats97 <- spread(stats97, key=measurement, value=X2) 
stats97

Notice that there are fewer rows than before; The spread() action shortens the length of the table.

Now each column is a variable, each row is an observation, and each cell is a single vale, so we have successfully tidied stats97.

The readr/tidyr cheat sheet gives a useful summary of the tidyverse functions for loading and tidying data.

Visualising data (ggplot2)

Having loaded some numerical data, a sensible first step is to visualise the distributions of variables to check for any issues.

Although there are many useful plotting functions available in base R, in this session we will focus on making plots using the tidyverse package ggplot2. This is a very powerful set of functions that support a consistent approach to data visualisation. ggplot2 makes it much easier to create high-quality graphics for presentations and publications.

Histogram

Let’s start with a histogram for the GDP data from the stats97 tibble:

stats97

Some of the GDP values are missing, so let’s remove those rows as a first step, using drop_na() from readr:

gdp97 <- drop_na(stats97, gdp)
gdp97
a <- ggplot(gdp97, aes(x=gdp))
hist <- a + geom_histogram(bins=100)
hist

Notice that there are two steps to creating the plot.

The first line constructs the data space by specifying the dataset (gdp97) and the variable(s) of interest (in this case, just gdp). The aes() function controls the aesthetic mappings used by ggplot2, i.e. the way in which the data will be mapped to visual elements.

In the second step, we add in a histogram geom that paints in the histogram bars and constructs the y-axis. Geoms are the actual visual marks that encode the data. ggplot2 provides geoms for all the commonly-used plot types, but also graphical primitives such as curves, lines and polygons, from which other visualisations can be built.

The ggplot2 defaults for axes, labels etc. are usually informative enough for data exploration, but of course everything can be modified if necessary. For example, to add at title, change the histogram colour and show population on a log scale:

hist <- a + geom_histogram(bins=20, fill="blue", alpha=0.5) + labs(x="GDP", title="Countries in 1997") + scale_x_log10()
hist

Saving plots to file

We can save the last plot made to a file using

ggsave("my_histogram.pdf")
## Saving 7 x 5 in image

The file type is determined by the file extension and the size of the image can be changed using width and height options.

Violin plot

The same data can be plotted using the geom_violin() geom. This time we will map gdp to the y coordinate:

ggplot(gdp97, aes(y=gdp, x="")) + geom_violin(fill="blue", alpha=0.5, linetype=0) + scale_y_log10()

Scatter plot

We can visualise covariation between variables using a scatter plot, for example GDP vs population. This uses geom_point():

ggplot(gdp97, aes(x=pop, y=gdp)) + 
  geom_point() + 
  scale_x_log10() + 
  scale_y_log10() 

The ggplot2 cheat sheet gives much more information about plotting options.

Manipulating data (dplyr)

The dplyr package deals with many types of data manipulation that are needed as part of any analysis workflow.

mutate()

It might be more useful to compare countries’ GDP on a per-capita basis. We need to make a new variable to show per-capita GDP. To do this, we will use the mutate() function, which adds new columns to the tibble:

gdp97 <- mutate(gdp97, gdp_pp=gdp/pop)
gdp97

Note that like all dplyr functions (and despite its name), the mutate() function does not actually change the original tibble used as input, so we still need to capture the output using <-.

Exercise

Visualise the distribution of the new variable gdp_pp.

##### SOLUTION

ggplot(gdp97, aes(x=gdp_pp)) + geom_histogram(bins=20, fill="blue", alpha=0.5) + labs(x="per-capita GDP", title="Countries in 1997")

#####

Selecting and filtering data

filter()

With tidy data, it is easy to filter cases according to whatever conditions we specify, e.g.:

filter(gdp97, gdp_pp > 30000, pop < 1000000)

select()

We can extract a subset of variables using select(), e.g.:

select(gdp97, c(geo,gdp_pp))

pull()

If you just want the values from a single column, use pull() to extract them as a vector:

pull(gdp97, geo)
##   [1] "ago" "alb" "and" "are" "arg" "arm" "atg" "aus" "aut" "aze" "bdi"
##  [12] "bel" "ben" "bfa" "bgd" "bgr" "bhr" "bhs" "bih" "blr" "blz" "bol"
##  [23] "bra" "brb" "brn" "btn" "bwa" "caf" "can" "che" "chl" "chn" "civ"
##  [34] "cmr" "cod" "cog" "col" "com" "cpv" "cri" "cub" "cyp" "cze" "deu"
##  [45] "dma" "dnk" "dom" "dza" "ecu" "egy" "eri" "esp" "est" "eth" "fin"
##  [56] "fji" "fra" "fsm" "gab" "gbr" "geo" "gha" "gin" "gmb" "gnb" "gnq"
##  [67] "grc" "grd" "gtm" "guy" "hnd" "hrv" "hti" "hun" "idn" "ind" "irl"
##  [78] "irn" "irq" "isl" "isr" "ita" "jam" "jor" "jpn" "kaz" "ken" "kgz"
##  [89] "khm" "kir" "kna" "kor" "kwt" "lao" "lbn" "lbr" "lca" "lka" "lso"
## [100] "ltu" "lux" "lva" "mar" "mda" "mdg" "mdv" "mex" "mhl" "mkd" "mli"
## [111] "mlt" "mmr" "mne" "mng" "moz" "mrt" "mus" "mwi" "mys" "nam" "ner"
## [122] "nga" "nic" "nld" "nor" "npl" "nzl" "omn" "pak" "pan" "per" "phl"
## [133] "png" "pol" "prt" "pry" "pse" "rou" "rus" "rwa" "sau" "sdn" "sen"
## [144] "sgp" "slb" "sle" "slv" "smr" "srb" "sur" "svk" "svn" "swe" "swz"
## [155] "syc" "tcd" "tgo" "tha" "tjk" "tkm" "ton" "tto" "tun" "tur" "tuv"
## [166] "tza" "uga" "ukr" "ury" "usa" "uzb" "vct" "ven" "vnm" "vut" "wsm"
## [177] "yem" "zaf" "zmb" "zwe"

arrange()

Use arrange() to sort a tibble by ascending column value:

arrange(gdp97, pop)

…or by descending value using desc():

arrange(gdp97, desc(pop))

The dplyr cheat sheet gives useful summaries of these and other functions for data manipulation.

Exercise

Use manipulations of the countries tibble to complete the following tasks:

  1. Find the Asian countries that are south of the equator.
##### SOLUTION

filter(countries, four_regions=="asia", Latitude<0)
#####
  1. Find the first five African countries to join the UN.
##### SOLUTION

dat <- filter(countries, four_regions=="africa")
dat <- arrange(dat, dat[["UN member since"]])
head(dat, 5)
#####
  1. Make a vector of OECD country names, sorted from East to West.
##### SOLUTION

dat <- filter(countries, members_oecd_g77=="oecd")
dat <- arrange(dat,desc(Longitude))
pull(dat, name)
##  [1] "New Zealand"     "Japan"           "Australia"      
##  [4] "South Korea"     "Turkey"          "Finland"        
##  [7] "Greece"          "Hungary"         "Poland"         
## [10] "Slovak Republic" "Czech Republic"  "Sweden"         
## [13] "Austria"         "Italy"           "Germany"        
## [16] "Denmark"         "Norway"          "Switzerland"    
## [19] "Luxembourg"      "Netherlands"     "Belgium"        
## [22] "France"          "United Kingdom"  "Spain"          
## [25] "Ireland"         "Portugal"        "Iceland"        
## [28] "United States"   "Mexico"          "Canada"
#####

Exercise

Starting with the co2 tibble, plot the annual emissions of a country of your choice. Hint: use the geom_line() geom.

##### SOLUTION

dat <- filter(co2, country=="Sweden")
ggplot(dat, aes(x=year, y=kt)) + geom_line()

#####

Joining tables

To compare emissions between countries in a fair way, it would make sense to convert them to a per-capita basis. Let’s start with the figures for 1997 to see how this can be done.

First we will make a new tibble containing only the 1997 emissions:

co2_1997 <- filter(co2, year==1997)
co2_1997

However, the population data is not yet in the co2 tibble, so we will need to look it up from another tibble by matching the country name. This type of relational data, where information must be collected from multiple tables, requires careful handling to make sure that rows in different tables are correctly associated with each other. The country name acts as a key to unlock the correct data from the associated table.

The relevant population data is in the stats97 table:

stats97

However, this is indexed by the geo code, rather than the country name that we find in co2_1997. Fortunately, the countries tibble contains both:

countries

Taking the co2_1997 data, we apply a left_join() to connect its country variable to the name variable in countries:

co2_1997 <- left_join(co2_1997, countries, by=c("country"="name"))
co2_1997

For every row in the original table, left_join() tries to match its country with a name in countries. The resulting table imports the additional columns from the countries tibble, so now we can associate each geo code with the correct CO_2 emissions.

left_join() is just one of several dplyr functions for working with relational data. You can read more about relational data in the R for Data Science online textbook.

Exercise

Use another left_join() to connect co2_1997 to stats97. Hint: dplyr will automatically match keys where the variable names are the same, so this join is a little easier than the last one.

##### SOLUTION

co2_1997 <- left_join(co2_1997,stats97)
## Joining, by = "geo"
co2_1997
#####
Exercise

Calculate the per-capita emissions for 1997 as a new column and plot these on a histogram.

##### SOLUTION

co2_1997 <- mutate(co2_1997, kt_pp = kt/pop)
co2_1997
ggplot(co2_1997,aes(x=kt_pp)) + geom_histogram(bins=20) + scale_x_log10()

#####

Pipes

Often, the operations we want to apply to data require several steps. Because R is organised around functions, we can sometimes get tied up with nested parentheses (), or confused by a lot of intermediate variables.

One solution for more understandable code is to make use of pipes (from the magrittr package) to chain functions together.

The pipe operator is %>%.

We can think of x %>% f as having the same meaning as f(x).

A simple example:

1:10 %>% mean
## [1] 5.5

We can also pipe the output to a new function:

1:10 %>% 
  mean %>%
  log
## [1] 1.704748

To capture the result of a chain of pipes, use variable assignment with the arrow operator as normal:

x <- 1:10 %>% 
  mean %>%
  log

x
## [1] 1.704748

By default, the piped value is used as the first argument of the function to which it is directed.

Any further positional arguments that are given to the function will be associated with the second argument onwards:

1:10 %>%
  sample(5)  # sample(x, s) returns a random sample (without replacement) of size s from the vector x
## [1] 2 8 9 5 6

However, for functions that take more than one argument, sometimes we need to refer to the piped value directly using .:

1:10 %>% 
  mean %>%
  log %>%
  rnorm(10, ., 1)  # rnorm(n, mu, sd) returns a vector of n random samples from a normal distribution with mean mu and standard deviation sd
##  [1] 0.8347929 2.8742657 2.8988907 0.9907220 1.9649831 0.8809225 0.7244809
##  [8] 1.3000776 3.2714297 0.2171248

Exercise

Starting with co2, recreate the 1997 per-capita emissions histogram, this time using pipes instead of overwriting variables.

co2 %>%

##### SOLUTION

  filter(year==1997) %>%                                                # filter co2 by year   
  left_join(countries, by=c("country"="name")) %>%                      # join countries
  left_join(stats97) %>%                                                # join stats97
  mutate(kt_pp = kt/pop) %>%                                            # calculate per-capita emissions
  ggplot(aes(x=kt_pp)) + geom_histogram(bins=20) + scale_x_log10()      # make the histogram
## Joining, by = "geo"

#####

Exercise

The file population_total.csv contains (real or predicted) population data for each country for the years 1800-2100.

Write a workflow to construct a tibble co2_pp containing the following columns:

  • country
  • year
  • kt = total CO2 emissions (in kilotonnes)
  • pop = total population
  • t_pp = per-capita CO2 emissions (in tonnes)
##### SOLUTION

population <- read_csv("population_total.csv") %>%        # load the population data
  gather(-country, key=year, value=pop, na.rm=TRUE) %>%   # tidy the data
  mutate(year=parse_integer(year))                        # convert strings to numbers
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   country = col_character()
## )
## See spec(...) for full column specifications.
co2_pp <- co2 %>%
  left_join(population) %>%                               # join on (country, year)
  drop_na(pop) %>%                                        # drop rows with no population data
  mutate(t_pp = 1000*kt/pop)                              # calculate per-capita emissions in tonnes
## Joining, by = c("country", "year")
#####

Summarising data across groups

Cases often belong to distinct groups that we want to compare with each other in some way.

Exercise

Using the output of the previous exercise and the countries tibble, add columns for the geo code and the eight_regions grouping.

##### SOLUTION

co2_pp <- countries %>%
  select(c(name, geo, eight_regions)) %>%        # select the desired columns
  left_join(co2_pp, ., by=c("country"="name"))   # join to co2_pp

#####

Box plots

Let’s look at the data for 2014 only. Here’s a more complex visualisation of the data, made by adjusting the data mappings in aes():

co2_pp %>%
  filter(year==2014) %>%                                  # filter by year
  ggplot(aes(x=eight_regions,                             # map eight_regions to x-axis
             y=t_pp,                                      # map t_pp to y-axis
             fill=eight_regions)) +                       # map eight_regions to fill colour
    geom_boxplot(alpha=0.5) +                             # make boxplots
    scale_fill_brewer(palette="Set1") +                   # choose the colour palette
    labs(x=NULL,                                          # remove the x-axis label
         y="CO2 emissions per capita / tonnes",           # change the y-axis label
         title="2014") +                                  # add a title
    guides(fill="none") +                                 # remove the legend
    scale_y_log10() +                                     # use a log scale
    coord_flip()                                          # transpose the coordinates x <-> y

group_by() and summarise()

Let’s extract the corresponding summary statistics for each group:

co2_pp %>%
  filter(year==2014) %>%
  group_by(eight_regions) %>%
  summarise(min=min(t_pp), 
            q25=quantile(t_pp,0.25), 
            median=median(t_pp), 
            q75=quantile(t_pp,0.75), 
            max=max(t_pp)) %>%
  arrange(desc(median))

dplyr provides the group_by() function to create a “grouped” version of the tibble. Manipulations on the grouped data will be applied to each group separately and then combined.

Here, the function summarise() is used to make new variables for the summary statistics. The resulting tibble has one row for each of the groups.

Exercise

Plot the total global CO2 emissions for each year.

##### SOLUTION 

co2_pp %>%
  group_by(year) %>%
  summarise(total=sum(kt)) %>%
  ggplot(aes(x=year,y=total)) + 
  geom_line()

#####

group_by() is a particularly powerful feature, because it allows you to group by more than one variable at a time. For example:

co2_pp %>%
  group_by(eight_regions, year) %>%
  summarise(median_tonnes_pp=median(t_pp))

Exercise

Plot the yearly median per-capita CO2 emissions for the eight regions.

Hint: to map a variable (e.g. xx) to line colour, use aes(color=xx).

##### SOLUTION 

co2_pp %>%
  group_by(eight_regions, year) %>%
  summarise(median=median(t_pp)) %>%
  ggplot(aes(x=year, y=median, color=eight_regions)) + 
    geom_line() + 
    scale_color_brewer(palette="Set1")

#####

Statistical testing

The Kyoto protocol committing states to reduce greenhouse gas emissions was signed in December 1997. Between then and 2014, is there any evidence of individal countries making effort to reduce CO2 output?

To begin, let’s consider the per capita CO2 emissions for europe_west in 1997 and 2014:

co2_pp %>%
  filter(eight_regions=="europe_west", year %in% c(1997,2014)) %>%
  group_by(year) %>%
  summarise(mean=mean(t_pp), sd=sd(t_pp))

So there is a reduction in the mean, but also a substantial variation in both years.

Let’s compare the distributions visually with overlaid histograms:

co2_pp %>%
  filter(eight_regions=="europe_west",                           # filter by eight_regions
         year %in% c(1997,2014)) %>%                             # filter by year
  mutate(year=factor(year)) %>%                                  # convert year to a factor for grouping
  ggplot(aes(x=t_pp, fill=year)) +                               # map year to fill colour
    geom_histogram(bins=15, alpha=0.4, position="identity")      # make overlaid histograms

A reduction looks plausible, but is there a statistically significant change in the mean?

Because the data are positively skewed, let’s work with log10(t_pp) to get closer to a normal distribution:

co2_pp %>%
  filter(eight_regions=="europe_west",     
         year %in% c(1997,2014)) %>%                            
  mutate(year=factor(year)) %>%                                 
  ggplot(aes(x=log10(t_pp), fill=year)) +                         # now using log10(t_pp)
    geom_histogram(bins=15, alpha=0.4, position="identity")     

We will use a paired sample t-test to test the hypothesis

H1: The mean change in log10(t_pp) from 1997 to 2014 is negative.

against the null hypothesis

H0: The mean change in log10(t_pp) from 1997 to 2014 is zero or positive.

This hypothesis test is one-tailed, because we are looking at evidence for a reduction in emissions.

We use the paired version of the t-test because an individual country’s value in 2014 clearly depends on its value in 1997.

We will use a significance level of 5%.

First we construct a tibble with columns for the data vectors that we need for the test:

dat <- co2_pp %>%
  filter(eight_regions=="europe_west",         # filter by eight_regions
         year %in% c(1997,2014)) %>%           # filter by year
  select(country,year,t_pp) %>%                # only using these variables
  spread(year,t_pp) %>%                        # reshape the data
  drop_na()                                    # drop rows with missing data

Note that this is no longer a tidy tibble, but that’s OK - we are just using it as an intermediate step towards the test.

Next we use the t.test() function to obtain a p-value.

For paired samples, it is important that the vectors are of the same length and in the same order. This has already been ensured because:

  1. the data are taken from a tibble containing the country names, so the values for 1997 and 2014 (taken from the tibble columns) are ordered in the same way.
  2. we have used drop_na() to remove any rows with a missing value.
res <- t.test(dat[["2014"]], dat[["1997"]], 
              paired=TRUE, 
              alternative = "less")          # H1: 2014 < 1997

print(res$p.value)
## [1] 1.634044e-06

So p < 0.05 and hence there is substantial evidence to reject H0 for this group of countries.

Exercise

Which countries in europe_west made the biggest per-capita reductions in CO2 emissions 1997-2014? Hint: Start with the reshaped tibble (dat) made above.

##### SOLUTION

dat %>%
  mutate(change=dat[["2014"]]-dat[["1997"]]) %>%
  arrange(change)
#####

Iteration

How can we repeat the t-test for the other groups?

Firstly, we need to get hold of the group names as a vector:

# convert column to a factor to extract the levels
groups <- levels(factor(co2_pp$eight_regions))

for() loop

One solution would be to adapt the earlier code into a for() loop:

for(g in groups) {
  
  dat <- co2_pp %>%
    filter(eight_regions==g,                     # filter by eight_regions
         year %in% c(1997,2014)) %>%           
    select(country,year,t_pp) %>%                
    spread(year,t_pp) %>%                        
    drop_na()                                    

  res <- t.test(dat[["1997"]], dat[["2014"]], 
              paired=TRUE, 
              alternative = "greater")        

  print( str_c(g, ":  p =", res$p.value) )       # str_c() concatenates strings
  
}
## [1] "africa_north:  p =0.999559227072308"
## [1] "africa_sub_saharan:  p =0.982843848654168"
## [1] "america_north:  p =0.869394149030891"
## [1] "america_south:  p =0.988422919341269"
## [1] "asia_west:  p =0.574257095673869"
## [1] "east_asia_pacific:  p =0.785473484507712"
## [1] "europe_east:  p =0.277596780928962"
## [1] "europe_west:  p =1.63404425150583e-06"

This is fine if we only need to look at the p-values, but things will get complicated if we want to collect them as a vector (for example, in order to correct for multiple hypothesis testing).

map()

A more elegant solution would be to use one of the map() functions from the purrr package.

map(x, f) takes a list (or vector) x and a function f, and passes each element of x to f in turn. The output of map() is a list (the same length as x) containing the individual outputs of f:

f <- function(x) {
  return(x + 100)
}

map(1:5, f)
## [[1]]
## [1] 101
## 
## [[2]]
## [1] 102
## 
## [[3]]
## [1] 103
## 
## [[4]]
## [1] 104
## 
## [[5]]
## [1] 105

Equivalently, using an anonymous function:

map(1:5, function(x) return(x + 100))
## [[1]]
## [1] 101
## 
## [[2]]
## [1] 102
## 
## [[3]]
## [1] 103
## 
## [[4]]
## [1] 104
## 
## [[5]]
## [1] 105

map() also supports shortcuts in the form of equations, which can allow more compact code:

map(1:5, ~ .x + 100)
## [[1]]
## [1] 101
## 
## [[2]]
## [1] 102
## 
## [[3]]
## [1] 103
## 
## [[4]]
## [1] 104
## 
## [[5]]
## [1] 105

For convenience and code optimisation, there are versions of map() that return vectors instead of lists. For example, map_dbl() returns a vector of numbers:

map_dbl(1:5, ~ .x + 100)
## [1] 101 102 103 104 105

We can use map_dbl() to produce a vector of p-values:

do_t_test <- function(g) {                     # our function to do one test
  
  dat <- co2_pp %>%
    filter(eight_regions==g,                   # filter by eight_regions
         year %in% c(1997,2014)) %>%           
    select(country,year,t_pp) %>%                
    spread(year,t_pp) %>%                        
    drop_na()                                    

  res <- t.test(dat[["1997"]], dat[["2014"]], 
              paired=TRUE, 
              alternative = "greater")        

  return(res$p.value)                          # function returns the p-value
  
}

pvals <- map_dbl(groups, do_t_test)    # map group names to the t-test function
names(pvals) <- groups                 # add the group names to the vector

pvals
##       africa_north africa_sub_saharan      america_north 
##       9.995592e-01       9.828438e-01       8.693941e-01 
##      america_south          asia_west  east_asia_pacific 
##       9.884229e-01       5.742571e-01       7.854735e-01 
##        europe_east        europe_west 
##       2.775968e-01       1.634044e-06

The map() functions are generally faster and more useful than an equivalent for() loop, and they are much easier to understand than the equivalent apply functions from base R.

walk()

walk(x, f) acts in the same way as map(), but returns the input list x.

This is a suitable alternative to map() when there is no output to be collected from f - for example, if we are using it to print something to screen or write some data to a file.

Exercise

Use walk() to make a version of the two-histogram plot for each of the eight regions, and save each one as a PNG image.

##### SOLUTION

makefig <- function(g){
  co2_pp %>%
    filter(eight_regions==g, year %in% c(1997,2014)) %>%
    mutate(year=factor(year)) %>%
    ggplot(aes(x=log(t_pp), fill=year)) + 
      geom_histogram(bins=15, alpha=0.4, position="identity")
  ggsave(str_c(g,".png"))
}

walk(groups, makefig)
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
#####

Further Reading

Textbooks

These online textbooks cover a wide range of topics in R data processing using the tidyverse.

Other web resources

The RStudio cheat sheets are great for a quick overview of the tidyverse functions.

The tidyverse overview provides links to further information on each individual package.

The RStudio team also provide some suggested learning paths for beginner, intermediate and expert R users.

Acknowledgements

All data used in this workshop are taken from https://www.gapminder.org/data/